Reading several files: alignement (e.g, YEAST case)

In this tutorial, we show how to read several data sets and save the results in a unique file

In the YEAST case, we have 36 experiments stored in 6 files called alpha0, alpha1, alpha5 , alpha10, alpha20, alpha45

We first want to read all all them and build a unique dataframe. This can be done using the class called MassSpecAlignmentYeast


In [1]:
%pylab inline
from msdas import *
from msdas import yeast


Populating the interactive namespace from numpy and matplotlib
Couldn't import dot_parser, loading of dot files will not be possible.

By default if you read a file called alpha, columns with measurements are renamed with the filename. E.g., a column called t0 is renamed as alpha0_t0. This is to avoid issue with identical names over several files (t0 may appear in all files). If you have specific prefixes to append, they can be provided like in the following examples.


In [2]:
filenames = yeast.get_yeast_filenames()

In [3]:
import pandas as pd
df1 = pd.read_csv(filenames[0])
df2 = pd.read_csv(filenames[1])
df1.columns


Out[3]:
Index([u'Protein', u' Psite', u' Sequence', u' t0', u' t1', u' t5', u' t10',
       u' t20', u' t45'],
      dtype='object')

In [4]:
df2.columns


Out[4]:
Index([u'Protein', u' Psite', u' Sequence', u' t0', u' t1', u' t5', u' t10',
       u' t20', u' t45'],
      dtype='object')

In [5]:
m = MassSpecAlignmentYeast(filenames, prefixes=["a0", "a1", "a5", "a10", "a20", "a45"], verbose=False)

We have merger the 6 yeast data sets altogether. The data is now available as a dataframe inside m.df


In [6]:
m.df.ix[0:3]


Out[6]:
Protein Sequence Psite Sequence_Phospho a0_t0 a0_t1 a0_t5 a0_t10 a0_t20 a0_t45 ... a20_t5 a20_t10 a20_t20 a20_t45 a45_t0 a45_t1 a45_t5 a45_t10 a45_t20 a45_t45
0 DIG1 DGNLASSNSAHFPPVANQNVK S126+S127 DGNLAS(Phospho)SNSAHFPPVANQNVK 0.00041509 0.00039651 0.0006711 0.00060249 0.00043997 0.00041787 ... 0.001149 0.000917 0.000902 0.001009 0.00028876 0.0003 0.00027013 0.00035849 0.00036712 0.00031307
1 DIG1 SAPAQVTQHSK S142 S(Phospho)APAQVTQHSK 0.00018739 0.00018479 0.0002666 0.00020245 0.00013835 0.00022575 ... 0.001135 0.000899 0.001064 0.001241 0.0011441 0.0013638 0.0012374 0.001091 0.0014252 0.001707
2 DIG1 VNDSYDSPLSGTASTGK S272 VNDSYDS(Phospho)PLSGTASTGK 0.00033752 0.0003301 0.00053798 0.00050547 0.00038083 0.00032833 ... 0.000349 0.000319 0.000314 0.000232 0.0001779 0.000208 0.000122 0.00021177 0.00020337 0.0002206
3 DIG1 VNDSYDSPLSGTASTGK S272^S275 VNDSYDS(Phospho)PLS(Phospho)GTASTGK 4.23e-05 4.7e-05 7.84e-05 4.92e-05 4.16e-05 3.58e-05 ... 0.000104 0.000075 0.000063 0.000061 6.4e-05 6.61e-05 7.44e-05 6.63e-05 6.18e-05 5.05e-05

4 rows × 40 columns


In [7]:
m.df.columns


Out[7]:
Index([u'Protein', u'Sequence', u'Psite', u'Sequence_Phospho', u'a0_t0',
       u'a0_t1', u'a0_t5', u'a0_t10', u'a0_t20', u'a0_t45', u'a1_t0', u'a1_t1',
       u'a1_t5', u'a1_t10', u'a1_t20', u'a1_t45', u'a5_t0', u'a5_t1', u'a5_t5',
       u'a5_t10', u'a5_t20', u'a5_t45', u'a10_t0', u'a10_t1', u'a10_t5',
       u'a10_t10', u'a10_t20', u'a10_t45', u'a20_t0', u'a20_t1', u'a20_t5',
       u'a20_t10', u'a20_t20', u'a20_t45', u'a45_t0', u'a45_t1', u'a45_t5',
       u'a45_t10', u'a45_t20', u'a45_t45'],
      dtype='object')

In [8]:
m.df.shape


Out[8]:
(57, 40)

In [9]:
r = readers.MassSpecReader(m)
from easydev import TempFile
f = TempFile() # a temporary named file
r.to_csv(f.name)
f.delete()


INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost

In [10]:
r.plot_phospho_stats()



In [ ]: